Goto

Collaborating Authors

 audio generation





ViSAudio: End-to-End Video-Driven Binaural Spatial Audio Generation

Zhang, Mengchen, Chen, Qi, Wu, Tong, Liu, Zihan, Lin, Dahua

arXiv.org Artificial Intelligence

Despite progress in video-to-audio generation, the field focuses predominantly on mono output, lacking spatial immersion. Existing binaural approaches remain constrained by a two-stage pipeline that first generates mono audio and then performs spatialization, often resulting in error accumulation and spatio-temporal inconsistencies. To address this limitation, we introduce the task of end-to-end binaural spatial audio generation directly from silent video. To support this task, we present the BiAudio dataset, comprising approximately 97K video-binaural audio pairs spanning diverse real-world scenes and camera rotation trajectories, constructed through a semi-automated pipeline. Furthermore, we propose ViSAudio, an end-to-end framework that employs conditional flow matching with a dual-branch audio generation architecture, where two dedicated branches model the audio latent flows. Integrated with a conditional spacetime module, it balances consistency between channels while preserving distinctive spatial characteristics, ensuring precise spatio-temporal alignment between audio and the input video. Comprehensive experiments demonstrate that ViSAudio outperforms existing state-of-the-art methods across both objective metrics and subjective evaluations, generating high-quality binaural audio with spatial immersion that adapts effectively to viewpoint changes, sound-source motion, and diverse acoustic environments. Project website: https://kszpxxzmc.github.io/ViSAudio-project.


BemaGANv2: A Tutorial and Comparative Survey of GAN-based Vocoders for Long-Term Audio Generation

Park, Taesoo, Jeong, Mungwi, Park, Mingyu, Kim, Narae, Kim, Junyoung, Kim, Mujung, Yoo, Jisang, Lee, Hoyun, Kim, Sanghoon, Kwon, Soonchul

arXiv.org Artificial Intelligence

This paper presents a tutorial-style survey and implementation guide of BemaGANv2, an advanced GANbased vocoder designed for high-fidelity and long-term audio generation. Long-term audio generation is critical for applications in Text-to-Music (TTM) and Text-to-Audio (TTA) systems, where maintaining temporal coherence, prosodic consistency, and harmonic structure over extended durations remains a significant challenge. Built upon the original BemaGAN architecture, BemaGANv2 incorporates major architectural innovations by replacing traditional ResBlocks in the generator with the Anti-aliased Multi-Periodicity composition (AMP) module, which internally applies the Snake activation function to better model periodic structures. In the discriminator framework, we integrate the Multi-Envelope Discriminator (MED), a novel architecture we proposed, to extract rich temporal envelope features crucial for periodicity detection. Coupled with the Multi-Resolution Discriminator (MRD), this combination enables more accurate modeling of long-range dependencies in audio. We systematically evaluate various discriminator configurations, including Multi-Scale Discriminator (MSD) + MED, MSD + MRD, and Multi-Period Discriminator (MPD) + MED + MRD, using objective metrics (Fréchet Audio Distance (FAD), Structural Similarity Index (SSIM), Pearson Correlation Coefficient (PCC), Mel-Cepstral Distortion (MCD)) and subjective evaluations (MOS, SMOS). This paper also provides a comprehensive tutorial on the model architecture, training methodology, and implementation to promote reproducibility. The code and pre-trained models are available at: https://github.com/dinhoitt/BemaGANv2.


TTMBA: Towards Text To Multiple Sources Binaural Audio Generation

He, Yuxuan, Yang, Xiaoran, Pan, Ningning, Huang, Gongping

arXiv.org Artificial Intelligence

Most existing text-to-audio (TT A) generation methods produce mono outputs, neglecting essential spatial information for im-mersive auditory experiences. To address this issue, we propose a cascaded method for text-to-multisource binaural audio generation (TTMBA) with both temporal and spatial control. First, a pretrained large language model (LLM) segments the text into a structured format with time and spatial details for each sound event. Next, a pretrained mono audio generation network creates multiple mono audios with varying durations for each event. These mono audios are transformed into binaural audios using a binaural rendering neural network based on spatial data from the LLM. Finally, the binaural audios are arranged by their start times, resulting in multisource binaural audio. Experimental results demonstrate the superiority of the proposed method in terms of both audio generation quality and spatial perceptual accuracy.


Model-Guided Dual-Role Alignment for High-Fidelity Open-Domain Video-to-Audio Generation

Zhang, Kang, Pham, Trung X., Lee, Suyeon, Niu, Axi, Senocak, Arda, Chung, Joon Son

arXiv.org Artificial Intelligence

We present MGAudio, a novel flow-based framework for open-domain video-to-audio generation, which introduces model-guided dual-role alignment as a central design principle. Unlike prior approaches that rely on classifier-based or classifier-free guidance, MGAudio enables the generative model to guide itself through a dedicated training objective designed for video-conditioned audio generation. The framework integrates three main components: (1) a scalable flow-based Transformer model, (2) a dual-role alignment mechanism where the audio-visual encoder serves both as a conditioning module and as a feature aligner to improve generation quality, and (3) a model-guided objective that enhances cross-modal coherence and audio realism. MGAudio achieves state-of-the-art performance on VGGSound, reducing FAD to 0.40, substantially surpassing the best classifier-free guidance baselines, and consistently outperforms existing methods across FD, IS, and alignment metrics. It also generalizes well to the challenging UnAV-100 benchmark. These results highlight model-guided dual-role alignment as a powerful and scalable paradigm for conditional video-to-audio generation. Code is available at: https://github.com/pantheon5100/mgaudio


UALM: Unified Audio Language Model for Understanding, Generation and Reasoning

Tian, Jinchuan, Lee, Sang-gil, Kong, Zhifeng, Ghosh, Sreyan, Goel, Arushi, Yang, Chao-Han Huck, Dai, Wenliang, Liu, Zihan, Ye, Hanrong, Watanabe, Shinji, Shoeybi, Mohammad, Catanzaro, Bryan, Valle, Rafael, Ping, Wei

arXiv.org Artificial Intelligence

Recent advances in the audio language modeling (ALM) domain tackle audio understanding and text-to-audio generation as separate tasks. V ery few studies attempt to unify these tasks - an essential step toward advanced multimodal reasoning. This paper introduces Unified Audio Language Model (UALM), which aims to unify audio understanding, text-to-audio generation, and multimodal reasoning in a single model. To achieve this goal, we first present UALM-Gen, a text-to-audio language model that directly predicts audio tokens and is comparable to state-of-the-art diffusion-based models. We then demonstrate, using proper data blending, training recipes, and inference techniques, that our single UALM model matches the quality of state-of-the-art specialized models in audio understanding, text-to-audio generation, and text reasoning. Furthermore, we present UALM-Reason, a multimodal reasoning model that utilizes both text and audio in the intermediate thinking steps to facilitate complex generation tasks. To our knowledge, this is the first demonstration in audio research of cross-modal generative reasoning, with its effectiveness confirmed by subjective evaluations. Figure 1: Humans need understanding, generation, and reasoning to handle complex tasks, like composing music. Human auditory intelligence is characterized by two fundamental capabilities: perception (understanding) and production (generation). This duality is not merely conceptual; neuro-scientific evidence reveals a profound synergy between these functions, where impairment in one often corresponds to a deficit in the other (Liberman et al., 1967; Hickok & Poeppel, 2007; Rizzolatti & Craighero, 2004). Furthermore, resolving complex acoustic challenges requires a sophisticated reasoning process that is inherently multimodal (McGurk & MacDonald, 1976; Leman, 2007; Denes & Pinson, 1993; Liberman & Mattingly, 1985).